Better handling of a bilingual collection of texts

نویسندگان

  • Alexandre Bérard
  • Philippe Langlais
چکیده

Statistical machine translation models are trained from parallel corpora, which are collections of translated texts. These texts are usually processed using dedicated tools called “sentence aligners”, which output parallel sentence pairs. However, parallel resources are very scarce in certain languages or domains. Alternative solutions have been proposed that extract parallel sentences from the so-called “comparable corpora”, containing texts in different languages sharing similar topics. But comparable corpora can contain document pairs with various degrees of parallelism. For example, in the Wikipedia corpus, many article pairs are actually parallel. We implement a system to extract parallel sentences from comparable corpora, and apply this system on the Wikipedia corpus. We also propose a method that determines whether two documents are parallel. By comparing sentence aligners with our parallel sentence extraction system, we suggest that extracting the parallel document pairs in a comparable corpus and using a sentence aligner on them might help improve the recall.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Intercultural Competence Formation: Teaching Reading of Profession-Related Texts in a Foreign Language to Agricultural Bilingual Students

The paper deals with the features of teaching of profession-related texts reading in a foreign language to bilingual students in agricultural higher education institution. Article’s purpose was to analyze the technology of intercultural competence formation by means of profession-related texts reading. The method of intercultural competence formation included using the profession-related texts ...

متن کامل

Key-phrase Extraction for Classification

In this paper we consider the problem of extracting key-phrases from a bilingual texts collection and using them for text classification. A key-phrase could be defined as a sequence of words of a given size in a given partial order that occur within a sentence. We describe an algorithm for the discovery of key-phrases. Then, a framework of handling multilingual texts / documents is described wh...

متن کامل

A Comparison in Reading Ability and Achievement between Mono-Lingual and Bilingual Fifth Graders

A Comparison in Reading Ability and Achievement between Mono-Lingual and Bilingual Fifth Graders Y. Adib, Ph.D. Z. Sharifi N. Mahmoodi To  compare both the reading ability and academic achievement among Farsi speaking mono-lingual fifth graders and their bilingual Aazari and Kordi counterparts, three samples of 153, 132, and 145 (total 430) such students from three cities o...

متن کامل

Identifying Parallel Documents from a Large Bilingual Collection of Texts: Application to Parallel Article Extraction in Wikipedia

While several recent works on dealing with large bilingual collections of texts, e.g. (Smith et al., 2010), seek for extracting parallel sentences from comparable corpora, we present PARADOCS, a system designed to recognize pairs of parallel documents in a (large) bilingual collection of texts. We show that this system outperforms a fair baseline (Enright and Kondrak, 2007) in a number of contr...

متن کامل

A Web-Enabled and Speech-Enhanced Parallel Corpus of Greek-Bulgarian Cultural Texts

This paper reports on completed work carried out in the framework of an EU-funded project aimed at (a) developing a bilingual collection of cultural texts in Greek and Bulgarian, (b) creating a number of accompanying resources that will facilitate study of the primary texts across languages, and (c) integrating a system which aims to provide web-enabled and speech-enhanced access to digitized b...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014